1
00:00:04,230 --> 00:00:11,230
[Music]

2
00:00:15,289 --> 00:00:13,669
my name is Zach and I'm really excited

3
00:00:17,990 --> 00:00:15,299
today to present on some of the work

4
00:00:20,510 --> 00:00:18,000
I've done so far during my PhD which is

5
00:00:21,950 --> 00:00:20,520
mostly looking into machine learning of

6
00:00:23,990 --> 00:00:21,960
the chemical inventory and rare

7
00:00:29,269 --> 00:00:24,000
isotopologues of the protostellar source

8
00:00:31,310 --> 00:00:29,279
IRAs excuse me 16293 2422b

9
00:00:33,770 --> 00:00:31,320
so first of all Let me give just a very

10
00:00:35,750 --> 00:00:33,780
high level overview of what supervised

11
00:00:39,049 --> 00:00:35,760
machine learning for regression is so

12
00:00:41,990 --> 00:00:39,059
the overall goal in this process is to

13
00:00:44,330 --> 00:00:42,000
learn a set of model parameters that can

14
00:00:46,729 --> 00:00:44,340
be best used to map some input features

15
00:00:49,250 --> 00:00:46,739
some X values to some relevant

16
00:00:51,650 --> 00:00:49,260
properties some y values just the

17
00:00:54,170 --> 00:00:51,660
simplest and oversimplified possible

18
00:00:55,610 --> 00:00:54,180
example of this process is just 2D

19
00:00:58,610 --> 00:00:55,620
linear regression something that could

20
00:01:00,410 --> 00:00:58,620
be modeled by y equals MX plus b and

21
00:01:03,170 --> 00:01:00,420
what you do in this process is you

22
00:01:06,649 --> 00:01:03,180
provide the model with training data in

23
00:01:09,050 --> 00:01:06,659
the form of X Y Pairs and what it does

24
00:01:11,270 --> 00:01:09,060
is it learns the model parameters in

25
00:01:13,609 --> 00:01:11,280
this case the m and the B that can best

26
00:01:15,950 --> 00:01:13,619
map those inputs to those outputs in a

27
00:01:18,469 --> 00:01:15,960
way that minimizes some sort of loss

28
00:01:20,149 --> 00:01:18,479
function and while this is a very

29
00:01:22,429 --> 00:01:20,159
oversimplified view of this whole

30
00:01:24,410 --> 00:01:22,439
approach it can also be very effective

31
00:01:27,050 --> 00:01:24,420
for high dimensional inputs and also

32
00:01:29,210 --> 00:01:27,060
include very complex non-linear models

33
00:01:30,830 --> 00:01:29,220
as well so

34
00:01:34,249 --> 00:01:30,840
now getting into the problem that I'm

35
00:01:35,810 --> 00:01:34,259
trying to tackle with this approach so

36
00:01:38,090 --> 00:01:35,820
the overarching goal behind most the

37
00:01:40,190 --> 00:01:38,100
work I've done so far is suggesting

38
00:01:42,830 --> 00:01:40,200
likely molecular candidates for

39
00:01:44,390 --> 00:01:42,840
detection in various regions of space so

40
00:01:46,190 --> 00:01:44,400
let's first take a step back and ask the

41
00:01:48,770 --> 00:01:46,200
question why do we even care about

42
00:01:50,389 --> 00:01:48,780
Interstellar molecules so because we're

43
00:01:52,010 --> 00:01:50,399
looking all the way out in space we're

44
00:01:54,350 --> 00:01:52,020
not fortunate to be able to throw a

45
00:01:55,850 --> 00:01:54,360
thermometer and a reaction Beaker easily

46
00:01:58,550 --> 00:01:55,860
observe some sort of physical change

47
00:02:00,109 --> 00:01:58,560
during a reaction instead in order to

48
00:02:01,789 --> 00:02:00,119
trace the physical properties and

49
00:02:03,889 --> 00:02:01,799
evolutionary history of interstellar

50
00:02:06,410 --> 00:02:03,899
sources we oftentimes rely on the

51
00:02:08,389 --> 00:02:06,420
molecules that we detect along with the

52
00:02:11,150 --> 00:02:08,399
properties of said molecules and just a

53
00:02:13,330 --> 00:02:11,160
couple of examples D to H ratios provide

54
00:02:15,949 --> 00:02:13,340
information about the temperatures

55
00:02:18,530 --> 00:02:15,959
of the environments during molecular

56
00:02:20,930 --> 00:02:18,540
formation and the detection of sio can

57
00:02:23,270 --> 00:02:20,940
trace things like Stellar outflows or

58
00:02:25,369 --> 00:02:23,280
shocks so in the past

59
00:02:26,990 --> 00:02:25,379
if we wanted to model molecular

60
00:02:28,430 --> 00:02:27,000
abundances and use us to predict new

61
00:02:30,710 --> 00:02:28,440
molecules for detection we relied

62
00:02:32,570 --> 00:02:30,720
heavily on astrochemical models and

63
00:02:34,250 --> 00:02:32,580
these as can be seen on the screen are

64
00:02:36,589 --> 00:02:34,260
vast networks of interconnected

65
00:02:38,750 --> 00:02:36,599
reactions each reaction having its own

66
00:02:41,509 --> 00:02:38,760
rate constant and ultimately they can be

67
00:02:44,630 --> 00:02:41,519
used to derive the time-dependent

68
00:02:46,729 --> 00:02:44,640
molecular abundances of these species

69
00:02:48,350 --> 00:02:46,739
and while these are excellent tools to

70
00:02:50,089 --> 00:02:48,360
gauge our current understanding of the

71
00:02:52,910 --> 00:02:50,099
specific chemical processes in space

72
00:02:55,550 --> 00:02:52,920
there's a couple of drawbacks first of

73
00:02:57,170 --> 00:02:55,560
all in order to add a new molecule or

74
00:02:59,570 --> 00:02:57,180
reaction to the network we oftentimes

75
00:03:01,150 --> 00:02:59,580
rely very heavily on chemical intuition

76
00:03:03,890 --> 00:03:01,160
additionally the

77
00:03:05,990 --> 00:03:03,900
rate constants that are inputted into

78
00:03:08,690 --> 00:03:06,000
the networks are oftentimes either

79
00:03:10,130 --> 00:03:08,700
extrapolated or approximated and the

80
00:03:11,809 --> 00:03:10,140
uncertainty that comes along with this

81
00:03:13,130 --> 00:03:11,819
when you propagate it through the whole

82
00:03:15,410 --> 00:03:13,140
network can result in some very

83
00:03:17,330 --> 00:03:15,420
uncertain or inaccurate modeled

84
00:03:19,610 --> 00:03:17,340
abundances and finally it's just

85
00:03:20,990 --> 00:03:19,620
difficult to include new molecules in

86
00:03:22,910 --> 00:03:21,000
order to add a new molecule to the

87
00:03:24,949 --> 00:03:22,920
network you have to include every single

88
00:03:26,390 --> 00:03:24,959
reaction that could create the molecule

89
00:03:28,190 --> 00:03:26,400
as well as everyone that could

90
00:03:31,369 --> 00:03:28,200
subsequently destroy it so that's a

91
00:03:33,830 --> 00:03:31,379
difficult and often times inefficient

92
00:03:35,509 --> 00:03:33,840
process so in response to these

93
00:03:37,490 --> 00:03:35,519
predictive shortcomings uh previous

94
00:03:39,410 --> 00:03:37,500
postdoc in our group Dr Kelvin Lee

95
00:03:41,149 --> 00:03:39,420
developed a machine learning method

96
00:03:43,250 --> 00:03:41,159
that's able to predict and model

97
00:03:45,830 --> 00:03:43,260
molecular abundances in space without

98
00:03:47,570 --> 00:03:45,840
requiring these complete networks and

99
00:03:49,550 --> 00:03:47,580
instead molecular abundances are

100
00:03:52,490 --> 00:03:49,560
expressed purely in terms of a chemical

101
00:03:55,070 --> 00:03:52,500
Vector space so in this process the

102
00:03:57,110 --> 00:03:55,080
first step is you need to collect

103
00:03:59,869 --> 00:03:57,120
telescope data toward a specific

104
00:04:02,330 --> 00:03:59,879
Interstellar source from this line

105
00:04:04,850 --> 00:04:02,340
survey you'll be able to decipher which

106
00:04:07,309 --> 00:04:04,860
molecules are present along with the

107
00:04:11,149 --> 00:04:07,319
abundances or column densities of said

108
00:04:12,949 --> 00:04:11,159
molecules following this for any machine

109
00:04:14,809 --> 00:04:12,959
learning application you need to

110
00:04:17,090 --> 00:04:14,819
vectorize your input so we have to make

111
00:04:20,210 --> 00:04:17,100
molecular feature vectors out of the

112
00:04:22,610 --> 00:04:20,220
molecules we're detecting to do this we

113
00:04:24,650 --> 00:04:22,620
utilize the multivac algorithm which is

114
00:04:26,870 --> 00:04:24,660
an unsupervised algorithm that creates

115
00:04:28,969 --> 00:04:26,880
context aware substructure Vector

116
00:04:31,610 --> 00:04:28,979
representations that can be subsequently

117
00:04:34,070 --> 00:04:31,620
summed to form molecular feature vectors

118
00:04:36,710 --> 00:04:34,080
so at this point we have our molecular

119
00:04:38,570 --> 00:04:36,720
feature vectors our inputs as well as

120
00:04:40,550 --> 00:04:38,580
our relevant column densities our

121
00:04:43,730 --> 00:04:40,560
outputs and what we do as mentioned

122
00:04:45,890 --> 00:04:43,740
previously we input this into a machine

123
00:04:48,590 --> 00:04:45,900
learning model that learns the best way

124
00:04:50,390 --> 00:04:48,600
the model parameters to map those

125
00:04:52,969 --> 00:04:50,400
molecular features to the relevant

126
00:04:55,790 --> 00:04:52,979
column densities and this is just a

127
00:04:58,550 --> 00:04:55,800
figure from the initial paper and what

128
00:05:00,770 --> 00:04:58,560
it shows is that a very simple red

129
00:05:02,810 --> 00:05:00,780
regularize linear regression machine

130
00:05:05,270 --> 00:05:02,820
learning method a ridge regression model

131
00:05:06,830 --> 00:05:05,280
is able to far out compete even the

132
00:05:10,430 --> 00:05:06,840
state-of-the-art Gotham Nautilus

133
00:05:13,430 --> 00:05:10,440
astrochemical model in reproducing and

134
00:05:16,969 --> 00:05:13,440
predicting the chemical abundances in

135
00:05:18,830 --> 00:05:16,979
the tmc-1 dark molecular cloud so while

136
00:05:21,530 --> 00:05:18,840
kelvin's initial work was a fantastic

137
00:05:23,270 --> 00:05:21,540
proof of concept that this method can in

138
00:05:25,610 --> 00:05:23,280
fact effectively model and predict

139
00:05:27,650 --> 00:05:25,620
molecular abundances in space there's a

140
00:05:30,469 --> 00:05:27,660
number of things that just remain simply

141
00:05:32,930 --> 00:05:30,479
untested first of all untested outside

142
00:05:35,029 --> 00:05:32,940
of dark molecular cloud so the initial

143
00:05:37,070 --> 00:05:35,039
work was focused on the tmc1 dark

144
00:05:39,830 --> 00:05:37,080
molecular cloud chemical inventory and

145
00:05:42,230 --> 00:05:39,840
this is a very cold and quiescent region

146
00:05:43,790 --> 00:05:42,240
of interstellar space so we also want to

147
00:05:46,909 --> 00:05:43,800
ensure that these same methods can also

148
00:05:49,070 --> 00:05:46,919
apply to warmer more turbulent protostor

149
00:05:52,129 --> 00:05:49,080
sources

150
00:05:55,790 --> 00:05:52,139
and for this we looked at the class 0

151
00:05:57,170 --> 00:05:55,800
protostor binary IRAs 16 293 B this is

152
00:05:59,090 --> 00:05:57,180
an especially attractive Source because

153
00:06:00,710 --> 00:05:59,100
it has a very dense molecular line

154
00:06:03,050 --> 00:06:00,720
survey and it's been studied extensively

155
00:06:05,210 --> 00:06:03,060
with interferometric data it's also

156
00:06:07,129 --> 00:06:05,220
vital that we can model the abundances

157
00:06:09,350 --> 00:06:07,139
in both these cold dark clouds and the

158
00:06:11,450 --> 00:06:09,360
warmer protosteller sources because

159
00:06:13,550 --> 00:06:11,460
understanding the chemical inventories

160
00:06:14,930 --> 00:06:13,560
of these two different sources allows us

161
00:06:17,870 --> 00:06:14,940
to investigate how the chemistry

162
00:06:20,270 --> 00:06:17,880
actually evolves as a star is forming

163
00:06:22,129 --> 00:06:20,280
additionally in part due to the

164
00:06:24,650 --> 00:06:22,139
shortcomings of the multivac algorithm

165
00:06:27,830 --> 00:06:24,660
there were initially no isotopologues

166
00:06:31,610 --> 00:06:27,840
included in the data set however iras16

167
00:06:34,070 --> 00:06:31,620
293b consistently shows very high levels

168
00:06:36,050 --> 00:06:34,080
of isotopic substitution as a result in

169
00:06:37,909 --> 00:06:36,060
order to fill out the data set we felt

170
00:06:40,629 --> 00:06:37,919
the need to include these molecules

171
00:06:43,370 --> 00:06:40,639
additionally as mentioned previously

172
00:06:45,170 --> 00:06:43,380
isotopologues provide information about

173
00:06:46,790 --> 00:06:45,180
the temperatures and time scales of

174
00:06:48,770 --> 00:06:46,800
molecular formation in space and

175
00:06:50,930 --> 00:06:48,780
therefore being able to model these

176
00:06:52,430 --> 00:06:50,940
ratios effectively with this machine

177
00:06:54,110 --> 00:06:52,440
learning method would provide a

178
00:06:57,350 --> 00:06:54,120
straightforward and efficient way to

179
00:06:59,990 --> 00:06:57,360
gain additional astrochemical Insight so

180
00:07:02,090 --> 00:07:00,000
in order to include these isotope logs

181
00:07:04,150 --> 00:07:02,100
we added hand-picked isotopolog

182
00:07:06,650 --> 00:07:04,160
descriptors at the end of our multivac

183
00:07:09,050 --> 00:07:06,660
representations and more specifically we

184
00:07:12,050 --> 00:07:09,060
added 19 extra Vector Dimensions that

185
00:07:14,570 --> 00:07:12,060
denoted which specific minor Isotopes

186
00:07:16,370 --> 00:07:14,580
are substituted into the molecule along

187
00:07:19,070 --> 00:07:16,380
with the chemical environment of said

188
00:07:21,050 --> 00:07:19,080
isotopic substitution so just as an

189
00:07:23,749 --> 00:07:21,060
example three of the vector Dimensions

190
00:07:26,330 --> 00:07:23,759
denote whether the 13 carbon is sp sb2

191
00:07:28,189 --> 00:07:26,340
or sp3 hybridized and we chose this

192
00:07:30,710 --> 00:07:28,199
feature because as you can see it has a

193
00:07:34,309 --> 00:07:30,720
notable impact on the mean 12C to 13c

194
00:07:36,230 --> 00:07:34,319
ratio of the molecules in this source so

195
00:07:37,730 --> 00:07:36,240
now getting into some results we train

196
00:07:40,189 --> 00:07:37,740
both a gaussian process regression and

197
00:07:41,629 --> 00:07:40,199
Bayesian Ridge regression model to map

198
00:07:44,089 --> 00:07:41,639
the molecular features of the column

199
00:07:46,670 --> 00:07:44,099
densities and what we're ultimately able

200
00:07:50,029 --> 00:07:46,680
to see is that the models are able to

201
00:07:52,490 --> 00:07:50,039
both effectively model the molecules

202
00:07:54,830 --> 00:07:52,500
provided to it in the training set but

203
00:07:57,890 --> 00:07:54,840
also extrapolate quite well to yet

204
00:08:00,409 --> 00:07:57,900
unseen molecules in the testing set

205
00:08:02,029 --> 00:08:00,419
additionally because we included isotope

206
00:08:04,330 --> 00:08:02,039
logs in our data set we wanted to see

207
00:08:06,650 --> 00:08:04,340
how well these models were able to

208
00:08:08,749 --> 00:08:06,660
reproduce the column densities and

209
00:08:11,270 --> 00:08:08,759
isotopic ratios of the molecules in the

210
00:08:14,270 --> 00:08:11,280
source so what you can see on the top

211
00:08:16,670 --> 00:08:14,280
row of the deuterium and 13c substituted

212
00:08:18,890 --> 00:08:16,680
as a topologues the using five-fold

213
00:08:21,589 --> 00:08:18,900
cross-validation the column these are

214
00:08:23,930 --> 00:08:21,599
very accurately modeled once you

215
00:08:25,610 --> 00:08:23,940
extrapolate this out to actual isotopic

216
00:08:27,830 --> 00:08:25,620
ratio predictions these are much more

217
00:08:31,490 --> 00:08:27,840
sensitive to small changes in column

218
00:08:33,110 --> 00:08:31,500
densities as a result a small error in

219
00:08:35,870 --> 00:08:33,120
the column density prediction can result

220
00:08:38,870 --> 00:08:35,880
in a large isotopic ratio error so the

221
00:08:41,570 --> 00:08:38,880
bottom row of actual isotopic ratios is

222
00:08:44,089 --> 00:08:41,580
slightly less precise however just

223
00:08:46,010 --> 00:08:44,099
because of how nuanced the process of

224
00:08:48,350 --> 00:08:46,020
isotopic fractionation is in space and

225
00:08:50,030 --> 00:08:48,360
how simple our encoding is we're very

226
00:08:52,550 --> 00:08:50,040
encouraged by these results that we're

227
00:08:53,990 --> 00:08:52,560
able to very accurately model the column

228
00:08:56,389 --> 00:08:54,000
densities of these isotopically

229
00:08:58,790 --> 00:08:56,399
substituted species

230
00:09:00,949 --> 00:08:58,800
so as mentioned previously due to the

231
00:09:02,449 --> 00:09:00,959
strong performance on the testing set we

232
00:09:04,870 --> 00:09:02,459
have some sort of confidence that these

233
00:09:08,449 --> 00:09:04,880
models can extrapolate to yet unseen

234
00:09:10,850 --> 00:09:08,459
species and as a result we proceeded to

235
00:09:12,530 --> 00:09:10,860
input about 90 000 astrochemically

236
00:09:14,870 --> 00:09:12,540
relevant molecules into the trained

237
00:09:16,610 --> 00:09:14,880
models to see which undetected species

238
00:09:18,350 --> 00:09:16,620
are likely the most abundant in this

239
00:09:20,690 --> 00:09:18,360
source and on the bar chart on the

240
00:09:24,350 --> 00:09:20,700
screen you can see the top 10 predicted

241
00:09:25,910 --> 00:09:24,360
abundance molecules toward IRS 6 and 293

242
00:09:27,590 --> 00:09:25,920
B and there's two things to point out

243
00:09:29,509 --> 00:09:27,600
here first of all three of these

244
00:09:31,430 --> 00:09:29,519
molecules namely hydrogen peroxide

245
00:09:33,050 --> 00:09:31,440
ethane and carbon dioxide have all been

246
00:09:35,570 --> 00:09:33,060
previously detected in different regions

247
00:09:38,210 --> 00:09:35,580
of space additionally something you may

248
00:09:41,329 --> 00:09:38,220
notice in the bar chart is that many of

249
00:09:44,030 --> 00:09:41,339
these molecules are very oxygenated and

250
00:09:46,550 --> 00:09:44,040
fairly saturated hydrocarbons and this

251
00:09:48,290 --> 00:09:46,560
is also a good sign because when looking

252
00:09:50,509 --> 00:09:48,300
at the actual chemical inventory of

253
00:09:52,670 --> 00:09:50,519
these sources or this specific Source

254
00:09:55,250 --> 00:09:52,680
sorry the most abundant detected

255
00:09:58,130 --> 00:09:55,260
molecules are also these very oxygenated

256
00:10:00,650 --> 00:09:58,140
hydrocarbons so not only is it learning

257
00:10:02,930 --> 00:10:00,660
to predict known Interstellar molecules

258
00:10:04,610 --> 00:10:02,940
but at the same time it's narrowing down

259
00:10:06,889 --> 00:10:04,620
to the correct region of chemical space

260
00:10:09,530 --> 00:10:06,899
relevant to this source

261
00:10:11,570 --> 00:10:09,540
so as I mentioned these 10 molecules

262
00:10:13,610 --> 00:10:11,580
have not been previously detected in

263
00:10:15,710 --> 00:10:13,620
this source and the reason for that in

264
00:10:17,630 --> 00:10:15,720
many cases is just simply a lack of

265
00:10:21,410 --> 00:10:17,640
rotational Spectra being taken in the

266
00:10:23,030 --> 00:10:21,420
lab so now next steps is we want to take

267
00:10:25,310 --> 00:10:23,040
that next step forward and collect the

268
00:10:27,050 --> 00:10:25,320
rotational Spectra of some of these

269
00:10:28,250 --> 00:10:27,060
predicted high abundance molecules so

270
00:10:29,990 --> 00:10:28,260
that they can be searched for in these

271
00:10:32,530 --> 00:10:30,000
protestalar sources one of particular

272
00:10:35,090 --> 00:10:32,540
interest is circled on the screen

273
00:10:36,949 --> 00:10:35,100
methoxyethanol and methoxyethanol isn't

274
00:10:38,930 --> 00:10:36,959
the same chemical family as both methoxy

275
00:10:40,850 --> 00:10:38,940
methanol and methoxyethane which have

276
00:10:44,870 --> 00:10:40,860
been detected in high abundance toward

277
00:10:47,030 --> 00:10:44,880
IRS 16 293 B but not only is this

278
00:10:48,350 --> 00:10:47,040
molecule chemically similar to several

279
00:10:49,850 --> 00:10:48,360
that have been seen before but we also

280
00:10:51,949 --> 00:10:49,860
have some sort of mechanistic reason to

281
00:10:54,710 --> 00:10:51,959
believe it may be present so methoxy

282
00:10:57,650 --> 00:10:54,720
methanol has been shown to form via

283
00:11:01,790 --> 00:10:57,660
reaction of the ch3o the methoxy radical

284
00:11:04,130 --> 00:11:01,800
with ch2oh on grain services so the high

285
00:11:06,290 --> 00:11:04,140
abundance of methoxy methanol also

286
00:11:08,630 --> 00:11:06,300
suggests that in the pre-stellar phase

287
00:11:10,370 --> 00:11:08,640
of the source that the methoxy radical

288
00:11:12,530 --> 00:11:10,380
was highly abundant on these grain

289
00:11:14,990 --> 00:11:12,540
services as a result it could feasibly

290
00:11:17,150 --> 00:11:15,000
react with the other high abundance

291
00:11:18,590 --> 00:11:17,160
organic radicals in the source and we

292
00:11:20,329 --> 00:11:18,600
therefore believe that the methoxylated

293
00:11:21,949 --> 00:11:20,339
versions of these high abundance

294
00:11:24,710 --> 00:11:21,959
Organics in the source may be strong

295
00:11:27,110 --> 00:11:24,720
targets for astrochemical study

296
00:11:28,910 --> 00:11:27,120
so next step is to use chirp pulse

297
00:11:30,530 --> 00:11:28,920
Fourier transform microwave spectroscopy

298
00:11:32,930 --> 00:11:30,540
to study the rotational spectrum of this

299
00:11:35,449 --> 00:11:32,940
molecule subsequently use the laboratory

300
00:11:37,610 --> 00:11:35,459
Spectrum to search for this molecule in

301
00:11:39,910 --> 00:11:37,620
various protestalar sources and upon its

302
00:11:44,990 --> 00:11:39,920
detection learn more about the chemistry

303
00:11:46,670 --> 00:11:45,000
of this highly abundant masoxy radical

304
00:11:48,110 --> 00:11:46,680
so that's all the work I've done so far

305
00:11:50,509 --> 00:11:48,120
as well as what I'm working towards I'd

306
00:11:51,889 --> 00:11:50,519
like to say a big thank you to my group

307
00:11:53,329 --> 00:11:51,899
shown on the screen the picture on the

308
00:11:55,310 --> 00:11:53,339
right is Us in the green Bank telescope

309
00:11:57,170 --> 00:11:55,320
which is a very cool experience I

310
00:11:58,790 --> 00:11:57,180
definitely recommend if you have the

311
00:12:00,470 --> 00:11:58,800
opportunity to travel there but thank

312
00:12:08,329 --> 00:12:00,480
you for listening and I'd be happy to